Skip to content

Comments

feat: bot detection logic#170

Merged
tkotthakota-adobe merged 87 commits intomainfrom
SITES-37727
Feb 13, 2026
Merged

feat: bot detection logic#170
tkotthakota-adobe merged 87 commits intomainfrom
SITES-37727

Conversation

@tkotthakota-adobe
Copy link
Collaborator

@tkotthakota-adobe tkotthakota-adobe commented Dec 23, 2025

Implemented comprehensive bot protection detection and alerting in the Task Processor to identify when sites are blocked by bot protection services (Cloudflare, Akamai, Imperva, etc.) and prevent unnecessary processing.

Tests:
DynamoDB records created and fetched using scrape client:
Sample record

{
  "pk": "$spacecat#scrapejobid_ef09f4e4-c9cb-4a5f-b884-ac9b3ffa69a3",
  "sk": "$scrapejob_1",
  "abortInfo": {
    "reason": "bot-protection",
    "details": {
      "totalUrlsCount": 1,
      "blockedUrlsCount": 150,
      "blockedUrlsSampled": true,
      "blockedUrls": [
        {
          "url": "https://www.abbvie.com/science/areas-of-innovation/ai-and-data-convergence.html",
          "blockerType": "cloudflare",
          "httpStatus": 403,
          "confidence": 0.99
        }
      ],
      "byBlockerType": {
        "cloudflare": 150
      },
      "byHttpStatus": {
        "403": 150
      },
      "auditType": "meta-tags"
    }
  }
}

Debug Log:

"[BOT-BLOCKED] Bot protection detected" 

Tests:
https://cq-dev.slack.com/archives/C060T2PPF8V/p1770848981948789?thread_ts=1770846988.025919&cid=C060T2PPF8V

image

@tkotthakota-adobe tkotthakota-adobe marked this pull request as draft December 23, 2025 02:33
@github-actions
Copy link

This PR will trigger no release when merged.

@codecov
Copy link

codecov bot commented Dec 23, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@tkotthakota-adobe
Copy link
Collaborator Author

tkotthakota-adobe commented Feb 7, 2026

@solaris007 Appreciate checking the IPs and adding both variables to secrets manager on all 3 environments. Review comments are addressed.

  • Do we use stage environment in some particular cases which I am not aware of?
  • Adjusted tests to use right module missed this after moving bot code to bot-detection.js.
  • IPs were suggested by cursor. I did not verify my bad. Looks like they were obtained from tests not from terraform. I also verified using terraform command. Agree with the IPs you posted. Sorry for the confusion.
  • log.debug is causing issue in the shared lib. Fixed it to use log.info.

@tkotthakota-adobe
Copy link
Collaborator Author

@solaris007 what could be causing this deploy error? Changes made should not effect deployment in my view.
Other services are deployed fine.
https://github.com/adobe/spacecat-task-processor/actions/runs/21789518189/job/62866745175?pr=170

image

@tkotthakota-adobe
Copy link
Collaborator Author

@solaris007 I noticed on aws console lambda is being deployed. For some reason Jenkins deployer is showing the error I shared earlier.

Copy link
Member

@solaris007 solaris007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review - All requested changes addressed

The 4 issues from the previous review have all been fixed:

  • Mock paths now correctly target bot-detection.js (verified all esmock calls)
  • Dead re-export removed - cloudwatch-utils.js only exports getAuditStatus
  • sortJobsByDate function vs sortedJobs result - no more name collision
  • Single bot-detection.test.js, no duplicate test file

Bot detection end-to-end flow is clean: handler gets jobId from scrape jobs, checkAndAlertBotProtection reads abortInfo via ScrapeClient, converts to stats, sends Slack alert. LGTM.

Minor cleanup items

  1. Duplicate JSDoc on filterJobsByTimestamp - Two JSDoc blocks stacked on the same function. Remove the shorter one.

  2. Unnecessary esmock mock - bot-detection.test.js mocks fetchRecentThreadMessages from slack-utils.js, but bot-detection.js doesn't import it. Harmless but confusing.

  3. Gist tarball dependencies - 4 packages on gist tarballs. Must be replaced with published npm versions once #1308 merges.

  4. getScrapeJobsByBaseURL no longer filters by 'default' processing type - Intentional? This now returns jobs of all processing types. Just confirming this is desired behavior.

@tkotthakota-adobe
Copy link
Collaborator Author

tkotthakota-adobe commented Feb 9, 2026

@solaris007 addressed your comments. For some reason this PR is not getting deployed. When I created fresh test PR it deployed fine. When I created new PR out of this PR it failed. That means some state in this PR is causing issue. I am running out of ideas. One other option is just cherry pick code to new PR. This issue started happening only from yesterday.

Update: Upgrading to latest @adobe/helix-deploy (same as in audit worker) solved the issue.

@tkotthakota-adobe tkotthakota-adobe merged commit 1741930 into main Feb 13, 2026
28 checks passed
@tkotthakota-adobe tkotthakota-adobe deleted the SITES-37727 branch February 13, 2026 01:38
solaris007 pushed a commit that referenced this pull request Feb 13, 2026
# [1.9.0](v1.8.2...v1.9.0) (2026-02-13)

### Features

* bot detection logic ([#170](#170)) ([1741930](1741930))
@solaris007
Copy link
Member

🎉 This PR is included in version 1.9.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants